SlideShare a Scribd company logo
Parallel algorithms Parallel and Distributed Computing Wrocław, 07.05.2010 Paweł Duda
Parallel algorithm – definition A  parallel algorithm  is an algorithm that has been specifically written for execution on a computer with two or more processing units.
Parallel algorithms can be run on computers with single processor (multiple functional units, pipelined functional units, pipelined memory systems)
Modelling algorithms 1 when designing algorithm, take into account the cost of communication, the number of processors (efficiency)  designer usually uses an abstract model of computation called  parallel   random-access machine ( P RAM) each CPU operation = one step model’s advantages
Modelling algorithms 2 - PRAM neglects such isses as synchronisation and communication no limit on the number of processors in the machine any memory location is uniformely accessible from any processor no limit on the amount of shared memory in the system
Modelling algorithms 3 - PRAM no conflict in accessing resources generally the programs written on those machines are MIMD
Multiprocessor model
Parallel Algorithms Multiprocessor model
Work-depth model How the cost of the algorithm can be calculated? Work - W Depth - D P = W/D –  PARALLELISM of the algorithm Picture:  Summing   16   numbers   on   a   tree.The   total   depth   (longest   chain   of   dependencies)   is   4   and   The   total   work   (number   of   operations)   is   15.
Mergesort Conceptually, a merge sort works as follows: input: sequence of n keys output: sorted sequence of n keys If the list is of length 1, then it is already sorted. Otherwise: Divide the unsorted list into two sublists of about half the size. Sort each sublist  recursively  by re-applying merge sort. Merge  the two sublists back into one sorted list.
Mergesort
General-purpose computing on graphics processing units (GPGPU) General-purpose computing on graphics processing units (GPGPU)  - recent trend GPUs co-processors  linear algebra matrix operations Nvidia's Tesla GPGPU card
Matrix multiplication Algorithm: MATRIX_MULTIPLY(A,B) 1 (l,m) := dimensions (A) 2 (m,n) := dimensions (B) 3  in parallel for  i  ∊  [o..l)  do 4  in parallel for  j  ∊  [0..n)  do 5  R ij  := sum( {  A ik  * B kj  : k  ∊  [0..m)  } ) We need log n matrix multiplications, each taking time O(n3) The serial complexity of this procedure is O(n 3 log n).
Search Dynamic creation of tasks and channels during program execution Looking for nodes coresponding to ‘solutions’ Initially a task created for the root of the tree procedure search(A) begin if(solution(A)) then score = eval(A); report solution and score else foreach child A(i) of A search (A(i)) endfor endif end
  Shortest-Path Algorithms The all-pairs shortest-path problem involves finding the shortest path between all pairs of vertices in a graph. A graph  G=(V,E)   comprises a set  V  of  N  vertices  {v i }  , and a set  E   ⊆   V  x X  of edges.  For (v i , v j ) and (v i ,v j ), i  ≠  j Picture:   A simple directed graph,  G , and its adjacency matrix,  A .  
Floyd’s algorithm Floyd’s algorithm is  a graph analysis algorithm for finding shortest paths in a weighted graph . A single execution of the algorithm will find the shortest paths between  all  pairs of vertices.
parallel Floyd’s algorithm 1 Parallel Floyd ’s algorithm  1 The first parallel Floyd algorithm is based on a one-dimensional, ro w wise domain decomposition of the intermediate matrix  I  and the output matrix  S . the algorithm can use at most  N   processors.  Each task has one or more adjacent rows of  I  and is responsible for performing computation on those rows.
parallel Floyd’s algorithm 1 Parallel version of Floyd's algorithm based on a one-dimensional decomposition of the I matrix.  In  (a) , the data allocated to a single task are shaded: a contiguous block of rows. In  (b) , the data required by this task in the k th step of the algorithm are shaded: its own block and the k th row.  
parallel Floyd’s algorithm 2 Parallel Floyd ’s algorithm   2 An alternative parallel version of Floyd's algorithm uses a two-dimensional decomposition of the various matrices. This version allows the use of up to  N 2   processors
parallel Floyd’s algorithm 2 Parallel Floyd  2 Parallel version of Floyd's algorithm based on a two-dimensional decomposition of the I matrix. In (a), the data allocated to a single task are shaded: a contiguous submatrix. In (b), the data required by this task in the k th step of the algorithm are shaded: its own block, and part of the k th row and column.  
Thank you for attention

More Related Content

PPT
Parallel algorithms
PPTX
Parallel algorithm in linear algebra
PPT
Parallel algorithms
DOCX
Parallel searching
PPT
Parallel algorithms
PPS
PRAM algorithms from deepika
PPTX
Matrix multiplication
PPTX
Parallel sorting algorithm
Parallel algorithms
Parallel algorithm in linear algebra
Parallel algorithms
Parallel searching
Parallel algorithms
PRAM algorithms from deepika
Matrix multiplication
Parallel sorting algorithm

What's hot (20)

PDF
Parallel Algorithms
PDF
Parallel Algorithms
PDF
Parallel Algorithms: Sort & Merge, Image Processing, Fault Tolerance
PPTX
Parallel algorithms
PPT
PPTX
Matlab
PPT
Graph Matching
PPTX
Matlab for Electrical Engineers
PPTX
#1 designandanalysis of algo
PDF
Matlab Presentation
PDF
Introduction to Matlab
PPTX
Fourier Transform Assignment Help
PDF
Parallel Algorithms
PDF
Parallel quicksort cz. 1
PPTX
Programming in python
PDF
Matlab-Data types and operators
PDF
MatLab Basic Tutorial On Plotting
PDF
Introduction to MATLAB
PPTX
Polymath For Chemical Engineers
PDF
Introduction to Cache-Oblivious Algorithms
Parallel Algorithms
Parallel Algorithms
Parallel Algorithms: Sort & Merge, Image Processing, Fault Tolerance
Parallel algorithms
Matlab
Graph Matching
Matlab for Electrical Engineers
#1 designandanalysis of algo
Matlab Presentation
Introduction to Matlab
Fourier Transform Assignment Help
Parallel Algorithms
Parallel quicksort cz. 1
Programming in python
Matlab-Data types and operators
MatLab Basic Tutorial On Plotting
Introduction to MATLAB
Polymath For Chemical Engineers
Introduction to Cache-Oblivious Algorithms
Ad

Viewers also liked (20)

PDF
Performance evaluation of diff routing protocols in wsn using difft network p...
PPT
Anotaciones semanticas
PPTX
Parallel sorting
PPTX
parallel Merging
PPT
Presentacion Proyecto Multiprocesamiento
PDF
Enhanced local search in artificial bee colony algorithm
PDF
Modified position update in spider monkey optimization algorithm
PPT
Parallel Algorithm Models
PDF
Introduction to algorithms
PPTX
Online exam series
PDF
Parallel Algorithms
PDF
An improved memetic search in artificial bee colony algorithm
PPTX
5 Pen PC Technology (P-ISM)
PPT
Parallel Computing
PPTX
WSN Routing Protocols
PPTX
Real Time Operating Systems
PPTX
Parallel Algorithms Advantages and Disadvantages
PPTX
Bus Interfacing with Intel Microprocessors Based Systems
PPTX
Multiprocessor system
Performance evaluation of diff routing protocols in wsn using difft network p...
Anotaciones semanticas
Parallel sorting
parallel Merging
Presentacion Proyecto Multiprocesamiento
Enhanced local search in artificial bee colony algorithm
Modified position update in spider monkey optimization algorithm
Parallel Algorithm Models
Introduction to algorithms
Online exam series
Parallel Algorithms
An improved memetic search in artificial bee colony algorithm
5 Pen PC Technology (P-ISM)
Parallel Computing
WSN Routing Protocols
Real Time Operating Systems
Parallel Algorithms Advantages and Disadvantages
Bus Interfacing with Intel Microprocessors Based Systems
Multiprocessor system
Ad

Similar to Parallel algorithms (20)

PPTX
In-class slides with activities
PPT
GraphAlgorithms.pptsfjaaaaaaaaaaaaaaaaaaa
PDF
Ndp Slides
PDF
Parallelising Dynamic Programming
PDF
Algorithm chapter 1
PPT
Parallel Programming Primer
PPT
Chap10 slides
PPT
Parallel Programming Primer 1
PPT
1535 graph algorithms
PPT
CS8461 - Design and Analysis of Algorithms
PDF
All Pair Shortest Path Algorithm – Parallel Implementation and Analysis
PPT
slides11.ppt
PPT
multi threaded and distributed algorithms
PDF
PDF
19IS402_LP1_LM_22-23.pdf
PDF
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
PPTX
unit 2 hpc.pptx
PPT
Parallel Processing Concepts
PDF
20121021 bspapproach tiskin
PDF
Unit- 2_my1.pdf jbvjwe vbeijv dv d d d kjd k
In-class slides with activities
GraphAlgorithms.pptsfjaaaaaaaaaaaaaaaaaaa
Ndp Slides
Parallelising Dynamic Programming
Algorithm chapter 1
Parallel Programming Primer
Chap10 slides
Parallel Programming Primer 1
1535 graph algorithms
CS8461 - Design and Analysis of Algorithms
All Pair Shortest Path Algorithm – Parallel Implementation and Analysis
slides11.ppt
multi threaded and distributed algorithms
19IS402_LP1_LM_22-23.pdf
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
unit 2 hpc.pptx
Parallel Processing Concepts
20121021 bspapproach tiskin
Unit- 2_my1.pdf jbvjwe vbeijv dv d d d kjd k

Parallel algorithms

  • 1. Parallel algorithms Parallel and Distributed Computing Wrocław, 07.05.2010 Paweł Duda
  • 2. Parallel algorithm – definition A parallel algorithm is an algorithm that has been specifically written for execution on a computer with two or more processing units.
  • 3. Parallel algorithms can be run on computers with single processor (multiple functional units, pipelined functional units, pipelined memory systems)
  • 4. Modelling algorithms 1 when designing algorithm, take into account the cost of communication, the number of processors (efficiency) designer usually uses an abstract model of computation called parallel random-access machine ( P RAM) each CPU operation = one step model’s advantages
  • 5. Modelling algorithms 2 - PRAM neglects such isses as synchronisation and communication no limit on the number of processors in the machine any memory location is uniformely accessible from any processor no limit on the amount of shared memory in the system
  • 6. Modelling algorithms 3 - PRAM no conflict in accessing resources generally the programs written on those machines are MIMD
  • 9. Work-depth model How the cost of the algorithm can be calculated? Work - W Depth - D P = W/D – PARALLELISM of the algorithm Picture: Summing 16 numbers on a tree.The total depth (longest chain of dependencies) is 4 and The total work (number of operations) is 15.
  • 10. Mergesort Conceptually, a merge sort works as follows: input: sequence of n keys output: sorted sequence of n keys If the list is of length 1, then it is already sorted. Otherwise: Divide the unsorted list into two sublists of about half the size. Sort each sublist recursively  by re-applying merge sort. Merge the two sublists back into one sorted list.
  • 12. General-purpose computing on graphics processing units (GPGPU) General-purpose computing on graphics processing units (GPGPU) - recent trend GPUs co-processors linear algebra matrix operations Nvidia's Tesla GPGPU card
  • 13. Matrix multiplication Algorithm: MATRIX_MULTIPLY(A,B) 1 (l,m) := dimensions (A) 2 (m,n) := dimensions (B) 3 in parallel for i ∊ [o..l) do 4 in parallel for j ∊ [0..n) do 5 R ij := sum( { A ik * B kj : k ∊ [0..m) } ) We need log n matrix multiplications, each taking time O(n3) The serial complexity of this procedure is O(n 3 log n).
  • 14. Search Dynamic creation of tasks and channels during program execution Looking for nodes coresponding to ‘solutions’ Initially a task created for the root of the tree procedure search(A) begin if(solution(A)) then score = eval(A); report solution and score else foreach child A(i) of A search (A(i)) endfor endif end
  • 15. Shortest-Path Algorithms The all-pairs shortest-path problem involves finding the shortest path between all pairs of vertices in a graph. A graph  G=(V,E) comprises a set  V  of  N  vertices {v i }  , and a set  E ⊆ V x X  of edges. For (v i , v j ) and (v i ,v j ), i ≠ j Picture:   A simple directed graph,  G , and its adjacency matrix,  A .  
  • 16. Floyd’s algorithm Floyd’s algorithm is a graph analysis algorithm for finding shortest paths in a weighted graph . A single execution of the algorithm will find the shortest paths between  all  pairs of vertices.
  • 17. parallel Floyd’s algorithm 1 Parallel Floyd ’s algorithm 1 The first parallel Floyd algorithm is based on a one-dimensional, ro w wise domain decomposition of the intermediate matrix  I  and the output matrix  S . the algorithm can use at most  N processors. Each task has one or more adjacent rows of  I  and is responsible for performing computation on those rows.
  • 18. parallel Floyd’s algorithm 1 Parallel version of Floyd's algorithm based on a one-dimensional decomposition of the I matrix. In (a) , the data allocated to a single task are shaded: a contiguous block of rows. In (b) , the data required by this task in the k th step of the algorithm are shaded: its own block and the k th row.  
  • 19. parallel Floyd’s algorithm 2 Parallel Floyd ’s algorithm 2 An alternative parallel version of Floyd's algorithm uses a two-dimensional decomposition of the various matrices. This version allows the use of up to N 2   processors
  • 20. parallel Floyd’s algorithm 2 Parallel Floyd 2 Parallel version of Floyd's algorithm based on a two-dimensional decomposition of the I matrix. In (a), the data allocated to a single task are shaded: a contiguous submatrix. In (b), the data required by this task in the k th step of the algorithm are shaded: its own block, and part of the k th row and column.  
  • 21. Thank you for attention

Editor's Notes

  • #4: A superscalar processor executes more than one instruction during a clock cycle by -> simultaneously dispatching multiple instructions to redundant functional units on the processor. -> Each functional unit is not a separate CPU core but an execution resource within a single CPU such as an  arithmetic logic unit , a bit shifter, or a  multiplier .
  • #5: RAM – for sequential algorithms CPU step like logical operations, memory accesses, arithmetic operations Model’s advantages – an algorithm’s designer can ignore details of machine the algorithm is executed on
  • #7: MIMD  (Multiple Instruction, Multiple Data)
  • #9: 1) Local A set of n processors each with its own local memory Processors connected to a common communication network Processor can access its own memory directly But also can access other’s processor memory, previously requesting it 2) Modular a)typically the modules (proc and mem) are arranged in the way that the access to memory is uniform for all processors b)the time depends on communication network and memory access pattern 3) PRAM a)processor can access any word of memory in a single step b) it’s just a model
  • #10: MIMD  (Multiple Instruction, Multiple Data)
  • #13: General-purpose computing on graphics processing units (GPGPU) is a fairly recent trend in computer engineering research. GPUs are co-processors that have been heavily optimized for computer graphics processing. Computer graphics processing is a field dominated by data parallel operations — particularly linear algebra matrix operations.
  • #15: Each circle represents a node in the search tree which is also a call to the search procedure. A task is created for each node in the tree as it is explored. At any one time, some tasks are actively engaged in expanding the tree further (these are shaded in the figure); others have reached solution nodes and are terminating, or are waiting for their offspring to report back with solutions. The lines represent the channels used to return solutions.  
  • #16:      We conclude this chapter by using performance models   to compare four different parallel algorithms for the all-pairs   shortest-path problem. This is an important problem in graph theory and has applications in communications, transportation, and electronics problems. It is interesting because analysis shows that three of the four algorithms can be optimal in different circumstances, depending on tradeoffs between computation and communication costs.